Predicting Stroke Risk from Common Health Indicators

Shree Krishna M.S Basnet
Supervisor: Dr. Cohen

2025-12-02

Stroke analysis and early prediction

  • Stroke comes without warning ,fatal , and can be partly prevented. Known to be major cause of death and long-term disability; a single event can even make humanbeing paralysis for life long.
  • Factors that can be catalyst possess more risk such as hypertension, high glucose / diabetes, heart disease, smoking, and obesity (BMI).
  • These factors are already recorded in routine care (age, blood-pressure history, glucose, BMI, smoking status) but are always overlooked without quality analysis till depth.
  • Motivation for analysis : use these routine measurements to build an interpretable logistic regression model, and compare it with machine-learning methods, to help us find high risk patient for prevention and counselling.

Research motivation & Objectives

  • Which is common, most vital health indicator connected with stroke risk?
  • How well can a binary logistic regression model separate stroke cases from non-stroke?
  • Do more complex machine learning models (Decision Tree, Random Forest, GBM, kNN, SVM) provide upgrade or liable, relevant result over logistic regression?
  • How does class imbalance (rare stroke outcome) affect model performance and choice of probability threshold?

Objectives

  • Build an interpretable logistic regression model for stroke prediction.
  • Compare performance with several ML classifiers on the same train/test split.
  • Evaluate using Accuracy, Sensitivity, Specificity, ROC, AUC, and Youden’s J.
  • Explore threshold tuning to improve detection of stroke cases.

Data & Key Variables

1. Dataset

  • Public stroke dataset (Kaggle).
  • N = 3,357 patients after cleaning and removing missing values.
  • Binary outcome: Stroke (Yes / No).


2. Main predictors

  • Demographic: Age, Gender, Ever_married, Work_type, Residence_type.
  • Clinical: Hypertension, Heart_disease.
  • Behavioural / metabolic: Smoking_status, BMI, Average glucose level.


3. Coding

  • Categorical variables recoded to numeric levels (1, 2, 3, …).
  • Outcome stroke coded as factor with levels No and Yes.
  • Non-predictive ID column removed.

Class Imbalance in Stroke Outcome

1. Outcome distribution

  • Stroke (Yes): ≈ 5–6% of patients
  • No stroke (No): ≈ 94–95% of patients

2. Why this matters

  • A model can reach > 90% accuracy by predicting “No stroke.”
  • High accuracy does not mean good stroke detection.
  • Must evaluate using Sensitivity, Specificity, and AUC.

3. Evaluation focus

  • Sensitivity: detects stroke cases (true positives)
  • Specificity: detects non-stroke cases
  • Precision: correctness of predicted stroke cases
  • AUC: overall ranking ability

Data Preparation

  • Removed rows with missing or inconsistent values(e.g., “Unknown”, “N/A”).
  • Converted variables to numeric / factor formats appropriate for modelling.
  • Checked ranges of Age, BMI, and Glucose for obvious data errors or outliers.
  • Defined strokeclean as the final cleaned dataset used in modelling.
  • Used a 70% training / 30% test split to assess out-of-sample performance.

Exploratory Data Analysis (Overview)

Our goal is to understand how key health indicators changes between patients with and without stroke.


We first examine:

    1. Distributions of Age, BMI, and Average Glucose
    2. Stroke rates across Hypertension, Heart Disease, Smoking
    3. Correlations among numeric predictors


These insights help identify which predictors may have the strongest relationship with stroke risk.

Density Plots for Age, BMI, and Glucose

Figure 1

Interpretation of Key Continuous Predictors

  • Age
    • Smooth distribution from late teens.
    • Majority in 40–70 = highest-risk band.
    • No sharp clusters = good continuous predictor.
  • Average Glucose Level
    • Clear right-skew; tail extends beyond 200 mg/dL.
    • Small subgroup with metabolic issues (likely diabetic).
    • Strong link to cardiovascular and stroke risk.
  • BMI
    • Compact range (~22–35) with few outliers.
    • Less variation = weaker predictive power.
    • Consistent with medical evidence and our model.

Stroke Prevalence by Categorical Predictors

Figure 2: Sample composition by gender, residence type, and smoking status.

Interpretation of Key Categorical Predictors

  • Gender
    • More females (61%) than males (39%).
    • Imbalance may influence how gender appears in the model.
  • Residence Type
    • Nearly equal Urban (51%) and Rural (49%) split.
    • No major geographic bias in the sample.
  • Smoking Status
    • Never smokers (54%) form the majority.
    • Former (25%) and current (22%) smokers well represented.
    • Enough variation to assess smoking as a stroke risk factor.

Stroke Risk for Clinical and Behavioral Predictors

Figure 3: Stroke percentages (95% CI) by hypertension, heart disease, and smoking status.

Interpretation of stroke risk vs predictors

  • Hypertension
    • Higher stroke risk in hypertensive patients.
    • Clear CI separation → strong association.
  • Heart Disease
    • Higher stroke percentages among those with heart disease.
    • Consistent with known cardiovascular risk patterns.
  • Smoking Status
    • Former and current smokers show higher stroke risk.
    • Reflects long-term vascular impact of tobacco exposure.
  • Overall
    • Vascular risks (hypertension, heart disease) and behavioural risk (smoking)
    • all show elevated stroke likelihood. Clinically consistent.

Correlation among key numeric prediators

Interpretation — Correlation Heatmap

  • Overall
    • Correlations are weak–moderate (0.00–0.26).
    • No multicollinearity concerns.
  • Age
    • Positive links with glucose (0.24), hypertension (0.26), heart disease (0.26).
    • Matches aging-related cardiovascular risk trends.
  • BMI
    • Very weak correlations (0.04–0.16).
    • Acts independently in this dataset.
  • Average Glucose
    • Moderate links with hypertension (0.17) and heart disease (0.14).
    • Consistent with metabolic–vascular patterns.
  • Hypertension & Heart Disease
    • Weak correlation (0.11) → related but not redundant.

Odds ratios and confidence intervals


Call:
glm(formula = stroke ~ age + hypertension + heart_disease + avg_glucose_level + 
    bmi + smoking_status + gender + ever_married, family = binomial(link = "logit"), 
    data = stroke_train)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -8.924079   0.992603  -8.991   <2e-16 ***
age                0.072637   0.008089   8.979   <2e-16 ***
hypertension       0.455377   0.229005   1.988   0.0468 *  
heart_disease      0.487385   0.270707   1.800   0.0718 .  
avg_glucose_level  0.003777   0.001705   2.215   0.0267 *  
bmi                0.006536   0.015709   0.416   0.6774    
smoking_status     0.234263   0.129934   1.803   0.0714 .  
gender             0.230592   0.206934   1.114   0.2651    
ever_married       0.118496   0.311030   0.381   0.7032    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 953.42  on 2348  degrees of freedom
Residual deviance: 776.77  on 2340  degrees of freedom
AIC: 794.77

Number of Fisher Scoring iterations: 7

Odds ratios and confidence intervals

                     OR 2.5 % 97.5 %
(Intercept)       0.000 0.000  0.001
age               1.075 1.059  1.093
hypertension      1.577 0.996  2.450
heart_disease     1.628 0.942  2.732
avg_glucose_level 1.004 1.000  1.007
bmi               1.007 0.975  1.037
smoking_status    1.264 0.978  1.629
gender            1.259 0.843  1.901
ever_married      1.126 0.590  2.013

Interpretation — Odds Ratios (Logistic Regression)

  • Age (OR 1.075, CI 1.059–1.093)
    Strongest predictor; each year ↑ stroke odds ~7.5%.

  • Hypertension (OR 1.577, CI 0.996–2.450)
    ~58% higher odds; borderline but clinically meaningful.

  • Heart disease (OR 1.628, CI 0.942–2.733)
    ~63% higher odds; CI includes 1 → not statistically strong.

  • Avg glucose (OR 1.004, CI 1.000–1.007)
    Slight ↑ in odds; marginal significance; aligns with metabolic risk.

  • BMI (OR 1.007, CI 0.975–1.037)
    No meaningful effect; CI overlaps 1.

  • Smoking
    Former: OR 1.263 (weak).
    Current: OR 1.598 (suggestive ↑ risk).

  • Gender (Female) (OR 1.259, CI 0.842–1.903)
    Slight ↑ odds; not significant.

  • Ever married (OR 1.126, CI 0.590–2.013)
    No clear effect.

Logistic Regression performance (Test Set)

Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  949  58
       Yes   0   1
                                         
               Accuracy : 0.9425         
                 95% CI : (0.9262, 0.956)
    No Information Rate : 0.9415         
    P-Value [Acc > NIR] : 0.4811         
                                         
                  Kappa : 0.0314         
                                         
 Mcnemar's Test P-Value : 7.184e-14      
                                         
            Sensitivity : 0.0169492      
            Specificity : 1.0000000      
         Pos Pred Value : 1.0000000      
         Neg Pred Value : 0.9424032      
             Prevalence : 0.0585317      
         Detection Rate : 0.0009921      
   Detection Prevalence : 0.0009921      
      Balanced Accuracy : 0.5084746      
                                         
       'Positive' Class : Yes            
                                         

Interpretation — Logistic Regression Performance (Test Set)

  • Accuracy: 94.25%
    High due to class imbalance (only ~6% stroke cases); not meaningful for detection.

  • Sensitivity: 0.017
    Detected 1 of 59 stroke cases → model fails to identify strokes.

  • Specificity: 1.00
    Perfect for non-stroke cases; predicts “No stroke” extremely well.

  • Precision (PPV): 1.00
    When predicting “Yes,” it was correct — but it predicted yes only once → misleadingly high.

  • NPV: 0.942
    Most “No” predictions are correct, consistent with majority class.

  • Kappa: 0.031
    Near zero → model performs only slightly better than random for imbalanced data.

  • Balanced Accuracy: 0.508
    Equivalent to chance level when sensitivity and specificity are weighted equally.

  • McNemar’s Test: p < 0.0001
    Errors are systematically biased toward predicting “No stroke.”

Overall:
Model performs well for non-stroke patients but fails severely for stroke detection.
Class imbalance dominates performance → requires resampling, class weights, or other imbalance-handling methods.

ROC curve and AUC for the logistic model

Interpretation — ROC Curve & AUC

  • AUC = 0.815 → indicates good discriminative ability.
    (0.5 = none, 0.7–0.8 = acceptable, 0.8–0.9 = good, >0.9 = excellent)

  • ROC evaluates performance across all thresholds, not just 0.5.

  • Despite poor sensitivity at the 0.5 cutoff,
    the AUC shows the model can separate stroke vs non-stroke reasonably well if a better threshold is chosen.

  • The gap between strong AUC and weak sensitivity reflects:
    • severe class imbalance
    • the need for customized probability cutoffs in medical prediction

Implications for improvement: - Threshold tuning
- Cost-sensitive / class-weighted training
- Resampling (e.g., SMOTE, oversampling)


  No  Yes 
3177  180 

Machine-learning model comparison

          Model       AUC  Accuracy Sensitivity Specificity
Accuracy     LR 0.7793712 0.9433962  0.00000000   0.9968520
Accuracy1  TREE 0.6475263 0.9414101  0.01851852   0.9937041
Accuracy2    RF 0.7250496 0.9463754  0.00000000   1.0000000
Accuracy3   GBM 0.7592884 0.9433962  0.00000000   0.9968520
Accuracy4   KNN 0.6668998 0.9463754  0.00000000   1.0000000
Accuracy5   SVM 0.6390929 0.9414101  0.00000000   0.9947534

Interpretation — Model Comparison

  • All six models show very high accuracy & specificity, driven by the dataset’s ~6% stroke rate (strong class imbalance).

  • Sensitivity is extremely low for every model → almost none correctly identify stroke cases.

  • Best AUC values:
    • Logistic Regression: 0.78
    • GBM: 0.76
    → These models discriminate high- vs low-risk patients better than others, despite poor sensitivity at the 0.5 threshold.

  • Tree-based models (DT, RF, GBM) show slightly higher sensitivity than LR, but still only ~1–2%.

  • KNN and SVM detect 0 stroke cases at the default threshold, despite high accuracy.

  • Overall: Accuracy is misleading; all models excel at predicting “No stroke” but fail at detecting positives.

  • Confirms severe class imbalance → meaningful improvement requires:
    • threshold tuning,
    • resampling (SMOTE/oversampling),
    • cost-sensitive learning.

ROC comaprision for all 6 model

Interpretation — ROC Comparison Across Models

  • Logistic Regression (AUC = 0.779) shows the best overall discrimination between stroke vs non-stroke.

  • GBM (AUC = 0.759) and Random Forest (AUC = 0.725) also perform well, close to LR.

  • KNN (AUC = 0.667) performs moderately — better than chance, but weaker than LR and tree ensembles.

  • Decision Tree (AUC = 0.648) and SVM (AUC = 0.639) have the lowest AUC values → weakest discrimination.

  • All models score above 0.5, meaning they perform better than random guessing, but with clear differences in quality.

  • ROC curves show that LR, RF, and GBM extract the strongest predictive patterns, outperforming simpler tree models and distance-based/SVM methods.

Overall: Logistic Regression is the most stable and best-performing model for this dataset, despite class imbalance.

Odds ratios and risk stratification

Interpretation — Forest Plot (Odds Ratios)

  • Hypertension
    Strongest predictor. OR > 2 with CI fully above 1 → hypertensive patients have more than double the odds of stroke.

  • Age (per year)
    OR slightly > 1 with a narrow CI above 1 → each year adds a consistent increase in stroke risk.

  • Average glucose level
    OR just above 1 with CI above 1 → higher glucose gives a modest but reliable rise in stroke risk.

  • Other predictors (ever married, heart disease, smoking, gender, BMI, residence, work type)
    CIs cross 1 → not statistically significant after adjustment.
    Some ORs are > 1 (e.g., heart disease, smoking), suggesting possible risk, but evidence is weak in this dataset.

Overall:
Hypertension, older age, and higher glucose are the clearest independent predictors of stroke.
Other factors show smaller or uncertain effects.
This pattern aligns with known clinical risk factors and supports the logistic model’s value for risk stratification.

Threshold tuning to 0.2 from 0.5

Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  903  46
       Yes  46  13
                                          
               Accuracy : 0.9087          
                 95% CI : (0.8892, 0.9258)
    No Information Rate : 0.9415          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.1719          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.22034         
            Specificity : 0.95153         
         Pos Pred Value : 0.22034         
         Neg Pred Value : 0.95153         
             Prevalence : 0.05853         
         Detection Rate : 0.01290         
   Detection Prevalence : 0.05853         
      Balanced Accuracy : 0.58593         
                                          
       'Positive' Class : Yes             
                                          

Interpretation — Threshold = 0.2

  • Sensitivity improves from 1 stroke detected → 13/59 detected
    (22%, up from 1.7%) when lowering the cutoff to 0.2.

  • Specificity stays high (≈95%), correctly classifying most non-stroke patients
    (903 out of 949 remain correctly labeled).

  • Overall accuracy drops slightly (94% → 91%), but balanced accuracy improves
    (≈0.51 → ≈0.59), reflecting better sensitivity–specificity trade-off.

Overall:
Lowering the threshold captures more true stroke cases with only a moderate increase in false positives — a clinically reasonable trade-off for early risk detection.

Conclusion

  • Stroke was a rare outcome (~5%), creating strong class imbalance and making detection of positive cases difficult.

  • Key predictors across models were age, hypertension, heart disease, average glucose, and smoking — consistent with established clinical risk factors.

  • Logistic Regression showed good discrimination (AUC ≈ 0.78) and remains a strong interpretable baseline model.

  • Despite good AUC, sensitivity at the default 0.5 threshold was extremely low due to class imbalance.

  • Lowering the threshold to 0.2 improved sensitivity (~22%) while maintaining high specificity (~95%), offering a more clinically reasonable trade-off.

  • Tree-based ensembles (RF, GBM) achieved slightly higher AUC but did not dramatically improve sensitivity and were less interpretable.

  • Accuracy and specificity were high for all models, but misleading, as they mainly reflected the dominance of the non-stroke class.

  • Results show that routine health indicators can meaningfully separate higher- vs lower-risk individuals, but handling class imbalance is critical.

References